SASSC: a standard Arabic single speaker corpus
نویسندگان
چکیده
This paper describes the process of collecting and recording a large scale Arabic single speaker speech corpus. The collection and recording of the corpus was supervised by professional linguists and was recorded by a professional speaker in a soundproof studio using specialized equipments and stored in high quality formats. The pitch of the speaker (EGG) was also recorded and synchronized with the speech signal. Careful attempts were taken to insure the quality and diversity of the read text to insure maximum presence and combinations of words and phonemes. The corpus consists of 51 thousand words that required 7 hours of recording, and it is freely available for academic and research purposes.
منابع مشابه
Language Variation as a Context for Information Retrieval
Speakers of widespread languages may encounter problems in information retrieval and document understanding when they access documents in the same language from another country. The work described here focuses on the development of resources to support improved document retrieval and understanding by users of Modern Standard Arabic (MSA). The lexicon of an Egyptian Arabic speaker and the lexico...
متن کاملCrowdsource a little to label a lot: labeling a speech corpus of dialectal Arabic
Arabic is a language with great dialectal variety, with Modern Standard Arabic (MSA) being the only standardized dialect. Spoken Arabic is characterized by frequent code-switching between MSA and Dialectal Arabic (DA). DA varieties are typically differentiated by region, but despite their wide-spread usage, they are under-resourced and lack viable corpora and tools necessary for speech recognit...
متن کاملTesting a large corpus of natural standard Arabic for rhythm class
Previous studies using acoustic correlates to measure speech rhythm have used small samples of audio and a limited number of speakers. Few have included standard Arabic in the analysis. This study uses Arabic news broadcast along with data output from an automatic speech recognizer timealigned transcript to test over 50 minutes of speech by 46 speakers. The results show that Arabic, like Englis...
متن کاملروشی جدید جهت استخراج موجودیتهای اسمی در عربی کلاسیک
In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...
متن کاملConstrained Cepstral Speaker Recognition Using Matched UBM and JFA Training
We study constrained speaker recognition systems, or systems that model standard cepstral features that fall within particular types of speech regions. A question in modeling such systems is whether to constrain universal background model (UBM) training, joint factor analysis (JFA), or both. We explore this question, as well as how to optimize UBM model size, using a corpus of Arabic male speak...
متن کامل